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Abstract 

This paper studies matrix completion under a general sampling model using the 
max-norm as a convex relaxation for the rank of the matrix. The optimal rate of con- 
vergence is established for the Frobenius norm loss. It is shown that the max-norm 
constrained minimization method is rate-optimal and it yields a more stable approxi- 
mate recovery guarantee, with respect to the sampling distributions, than previously 
used trace-norm based approaches. The computational effectiveness of this method is 
also studied, based on a first-order algorithm for solving convex programs involving a 
max-norm constraint. 
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1 Introduction 



The problem of recovering a low-rank matrix from a subset of its entries, also known as 
matrix completion, has been an active topic of recent research with a range of applications 
including collaborative filtering (the Netflix problem) [15], multi-task learning [1], system 
identification [29], and sensor localization [3, 38, 39], among many others. We refer to [6] 
for a discussion of the above mentioned applications. Another noteworthy example is the 
structure-from-motion problem [10, 43] in computer vision. Let / and d be the number 
of frames and feature points respectively. The data are stacked into a low-rank matrix 
of trajectories, say M G M^'^^'^, such that every element of M corresponds to an image 
coordinate from a feature point of a rigid moving object at a given frame. Due to objects 
occlusions, errors on the tracking or variable out of range (i.e. images beyond the camera 
field of view), missing data are inevitable in real- life applications and are represented as 
empty entries in the matrix. Therefore, accurate and effective matrix completion methods 
are required, which fill in missing entries by suitable estimates. 

As a direct search for the lowest-rank matrix satisfying the equality constraints is NP- 
hard [5], most previous work on matrix completion has focused on using the trace-norm, 
which is defined to be the sum of the singular values of the matrix, as a convex relaxation 
for the rank. This can be viewed as an analog to relaxing the sparsity of a vector to 
its ^i-norm, which has been shown to be effective both empirically and theoretically in 
compressed sensing. Several recent papers proved in different settings that a generic d x d 
rank-r matrix can be exactly and efficiently recovered from 0(rdpoly log((i)) randomly 
chosen entries [8, 9, 17, 35]. These results thus provide theoretical guarantees for the trace- 
norm constrained minimization method. In the case of recovering approximately low-rank 
matrices based on noisy observations, different types of trace-norm based estimators, which 
are akin to the Lasso and Dantzig selector used in sparse signal recovery, were proposed 
and well-studied [6, 20, 36, 24, 32, 23, 22, 21]. 

It is, however, unclear that the trace-norm is the best convex relaxation for the rank. 
A matrix M S ^d,i^d,2 ^g^-^ viewed as an operator mapping from W^^ to W^^, its rank 
can be alternatively expressed as the smallest integer k such that the matrix M can be 
decomposed as M = UV'^ for some U G and V € M'^^x^^ view of the matrix 

factorization M = UV'^, we would like U and V to have a small number of columns. The 
number of columns of U and V can be relaxed in a different way from the usual trace-norm 
by the so-called max-norm [28] which is defined by 

||M||niax = min {||C/||2,oo||l^||2,oo}, (1-1) 

where the infimum is over all factorizations M = UV'^ with ||f/||2,oo being the operator 
norm of U : £2 ^ i'^ and ||F||2,oo the operator norm of V : £2 ^ (or, equivalently. 
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V'^ : if^ — )• £2) ^i^d k = 1, ...,di Ad2- Note that ||f/||2,oo is the maximum £2 row norm of U. 
Smce £2 is a Hilbert space, the factorization constant || • ||max indeed defines a norm on the 
space of operators between i'f' and £'^. 

The max-norm was recently proposed as another convex surrogate to the rank of the 
matrix. For collaborative filtering problems, the max-norm has been shown to be empiri- 
cally superior to the trace- norm [40] . Foygel and Srebro [14] used the max-norm for matrix 
completion under the uniform sampling distribution. Their results are direct consequences 
of a recent bound on the excess risk for a smooth loss function, such as the quadratic loss, 
with a bounded second derivative [42]. 

Matrix completion has been well analyzed under the uniform sampling model, where 
observed entries are assumed to be sampled randomly and uniformly. In such a setting, the 
trace-norm regularized approach has been shown to have good theoretical and numerical 
performance. However, in some applications such as collaborative filtering, the uniform 
sampling model is unrealistic. For example, in the Netflix problem, the uniform sampling 
model is equivalent to assuming all users are equally likely to rate each movie and all 
movies are equally likely to be rated by any user. From a practical point of view, invari- 
ably some users are more active than others and some movies are more popular and thus 
rated more frequently. Hence, the sampling distribution is in fact non-uniform. In such a 
setting, Salakhutdinov and Srebro [37] showed that the standard trace-norm relaxation can 
behave very poorly, and suggested a weighted trace-norm regularizer, which incorporates 
the knowledge of true sampling distribution in its construction. Since the true sampling 
distribution is almost always unknown and can only be estimated based on the locations of 
those entries that are revealed in the sample, a commonly used method in practice is the 
empirically- weighted trace-norm [13]. 

In this paper we study matrix completion based on the noisy observations under a 
general sampling model using the max-norm as a convex relaxation for the rank. The rate of 
convergence for the max-norm constrained least squares estimator is obtained. Information- 
theoretical methods are used to establish a matching minimax lower bound under the 
general non-uniform sampling model. The minimax upper and lower bounds together yield 
the optimal rate of convergence for the Frobenius norm loss. It is shown that the max-norm 
regularized approach indeed provides a more stable approximate recovery guarantee, with 
respect to the sampling distributions, than previously used trace-norm based approaches. 
In the special case of the uniform sampling model, our results also show that the extra 
logarithmic factors in the results given in [42] could be avoided after a careful analysis to 
match the minimax lower bound with the upper bound (see Theorems 3.1 and 3.2 and the 
discussions in Section 3). 

The max-norm constrained minimization problem is a convex program. The computa- 
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tional effectiveness of this method is also studied, based on a first-order algorithm developed 
in [26] for solving convex programs involving a max-norm constraint, which outperforms the 
semi-definite programming (SDP) method of Srebro, et al. [40]. We will show in Section 4 
that the convex optimization problem can be implemented in polynomial time as a function 
of the sample size and the matrix dimensions. 

The remainder of the paper is organized as follows. After introducing basic notation 
and definitions. Section 2 collects a few useful results on the max-norm, trace-norm and 
Rademacher complexity that will be needed in the rest of the paper. Section 3 introduces 
the model and the estimation procedure and then investigates the theoretical properties of 
the estimator. Both minimax upper and lower bounds are given. The results show that 
the max-norm constrained minimization method achieves the optimal rate of convergence 
over the parameter space. Comparison with past work is also given. Computation and 
implementation issues are discussed in Section 4. The proofs of the main results and key 
technical lemmas are given in Section 5. 



2 Notations and Preliminaries 

For any positive integer d, we use [d] to denote the collection of integers {1,2, For 
a vector n G M'^ and < p < oo, denote its £p-norm by \\u\\p = In 
particular, ||it||oo = ^^^i=i,...,d\ui\ is the ^oo-norm. For a matrix M = {Mu) G W'-^'"^^, let 

ll^lli^ = \Jt1=i YaIi^Ii be the Frobenius norm and let ||co = niax^. ; denote 
the elementwise £oo-iiorm. Given two norms i.p ctnci i.q on M*^^ cuid M*^^ respectively, the 
corresponding operator norm || • ||p,g of a matrix M G R^iX'^a defined by ||M||p,g = 
sup||^ll^=i ||MrE||g. It is easy to verify that ||M||p_g = ||M-^||q*^p*, where {p,p*) and {q,q*) 
are conjugate pairs, i.e. ^ + = 1. In particular, ||M|| = ||M||2,2 is the spectral norm; 

||-^||2,oo = ^^^k=i,...,di \jYlf=i ^^kl maximum row norm of M. 

We collect in this section some known results on the max-norm, trace-norm and Rademacher 
complexity that will be used repeatedly later. 

2.1 Max-norm and trace-norm 

For a matrix M G M^i^''^^ its tr ace- norm ||M||* can also be equivalently written as 

||M||* = inf I ^ \aj\ : M = ^ ajujvj , \\uj\\2 = \\vj\\2 = ij- 
j j 

In other words, the trace-norm promotes low-rank decompositions with factors in £2- Sim- 
ilarly, using Grothendiek's inequality [18], the max-norm defined in (1.1) has the following 
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analogous representation in terms of factors in l^: 

||^/||max ~ inf I ^ \aj\ : M = '^ajujvj, \\uj\\oo = \\vj\\oo = l|- 
j j 

The factor of equivalence is the Grothendieck's constant Kq £ (1-67, 1.79). Based on these 
properties, Lee, et al. (2010) [26] expected max-norm regularization to be more effective 
when dealing with uniformly bounded data. 

Of the same flavor as the definition of the max-norm in (1.1), the trace- norm has the 
following equivalent characterization in terms of the matrix factorization [41], 

||M||, = min {IIC/IIfII'^'IIf} = ^ min (\\UfF + \\V\\l). 

It is easy to see that 

IIMIL 

^= < l|M||„,ax, (2.1) 

V"l«2 

which in turn implies that any low max-norm approximation is also a low trace-norm 
approximation. As pointed out in [41], there can be a large gap between ^^^^ \\ ■ ]]* and 
11 ■ Umax- The following relationship between the trace-norm and Frobenius norm is well- 
known. 



\\M\\f < \\M\\^ < A/rank(M) • \\M\\f. 

In the same spirit, an analogous bound holds for the max-norm, in connection with the 
element- wise ^oo-norm [28]: 

llMlloo < llMlUax < \/rank(M) • \\M\\i^^ < Vrank(Af) • \\M\\^. (2.2) 

For any R > 0, let 

Bmax(^) := {M e R'^^x'^^ : UMlUax < R} and M,{R) := {M G R^^x^^ : \\M\\^ < R} 

(2.3) 

be the max-norm and trace-norm ball with radius R, respectively. It is now well-known [41] 
that ]Bmax(l) can be bounded, from both below and above, by the convex hull of rank-one 
sign matrices M± = {M £ {±lYi^d2 . rank(M) = 1}. That is, 

convA^i C ]Bmax(l) C Kg ■ convA^i (2.4) 

with Kg G (1-67, 1.79) denoting the Grothendieck's constant. Moreover, M± is a finite 
class with cardinality 1A^±1 = 2'^~^ where d = di + d2. 
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2.2 Rademacher complexity 

A technical tool used in our analysis involves data-dependent estimates of the Rademacher 
and Gaussian complexities of a function class. We refer to Bartlett and Mendelson [2] and 
references therein for a detailed introduction of these concepts. 

Definition 2.1. For a class T of functions mapping from X to its empirical Rademacher 
complexity over a specific sample S = {xi,X2., ■■■) d X is given by 
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\S\ 



Rs{J') = ii^^e sup|Ve,/(xi) , (2.5) 



where e = (£1,625 •••) is a Rademacher sequence. The Rademacher complexity with respect 
to a distribution V is the expectation, over an i.i.d. sample of \S\ points drawn from V, 
denoted by 

R\s\{F) =W.sMRs{^)]- 

Replacing with independent Gaussian A^(0, 1) random variables gi,...,gn leads to 

the definition of (empirical) Gaussian complexity. 

Considering a matrix as a function from the index pairs to the entry values, Srebro and 
Shraibman [41] obtained upper bounds on the Rademacher complexity of the unit balls 
under both the trace-norm and the max-norm. Specifically, for any di,d2 > 2 and any 
sample of size 2 < \S\ < did2, the empirical Rademacher complexity of the max-norm unit 
ball is bounded by 

i?5(lB^ax(l)) < 12^^^. (2.6) 

3 Max-Norm Constrained Minimization 
3.1 The model 

We now consider matrix completion under a general random sampling model. Let Mq S 
]^dixd2 g^jj unknown matrix. Suppose that a random sample 

S = {{iiji), (^2, J2), (.injn)} 

of the index set is drawn i.i.d. according to a general sampling distribution 11 = {tt^i} on 
[di] X [^2], with replacement, i.e. ¥[{it,jt) = ik,l)] = t^h for all t and {k,l). Given the 
random index subset S = {(ii, ji), {in,jn)}, we observe noisy entries {Yi^J^}f^-^ indexed 
by S, i.e. 

^itdt = {Mo)i^j, + ait, t = 1, n, (3.1) 
for some a > 0. The noise variables are independent, with E[^t] = and ^[Ct] = 1- 
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Instead of assuming the uniform sampUng distribution, we consider a general samphng 
distribution 11 here. Since Ylk=i Sf=i '^ki = 1, we have maxfc^;{7rfc;} > In addition, to 
ensure that each entry is observed with a positive probabihty, it is natural to assume that 
there exists a positive constant /U > 1 such that 

Trki>—^, foTall{k,l)e[di]x[d2]. (3.2) 

We write hereafter d = di + d2 for brevity. Clearly, max{di, d2) <d< 2max((ii, ^2). 

Past work on matrix completion has mainly focused on the case of exact low-rank 
matrices. Here we allow a relaxation of this assumption and consider the more general 
setting of approximately low-rank matrices. Specifically, we consider recovery of matrices 
with £oo-norm and max-norm constraints defined by 

IC{a,R) := {m G R'^^'"^^ : \\M\\^ < a, < i?}. (3.3) 

Here both a and R are free parameters to be determined. It is clear that R> a is needed 
to guarantee that IC{a,R) is non-empty. If Mq is of rank at most r and ||Mo||oo ^ Oi, then 
by (2.2) we have Mq £ 

18max(ct\A') sjid hence Mq G IC{o!, a^/r) . 
3.2 Max-norm constrained least squares estimator 

Given a collection of observations I5 = {Yi^j^}2^^ from the observation model (3.1), we 
estimate the unknown Mq E /C(a, R) for some i? > a > by the minimizer of the empirical 
risk with respect the quadratic loss function 

1 " 

^niM;Y) = -^iY,,j,-Mi,,,,)\ 
t=i 

That is, 

Mmax := argmin £„(M;y). (3.4) 

The minimization procedure requires that all the entries of Mq are bounded in absolute 
value by a known constant a. This condition enforces that Mq should not be too "spiky", 
and a too large bound may jeopardize exactness of the estimation, see, e.g. [24, 32, 21]. 
On the other hand, as argued in Lee, et al. [26], the max-norm regularization is expected 
to be more effective particularly for uniformly bounded data, which is our main motivation 
for using the max-norm constrained estimator. 

Although the max-norm constrained minimization problem (3.4) is a convex program, 
fast and efficient algorithms for solving large-scale optimization problems that incorporate 
the max-norm have only been developed recently [26]. We will show in Section 4 that the 
convex optimization problem (3.4) can indeed be implemented in polynomial time as a 
function of the sample size n and the matrix dimensions di and d2- 
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3.3 Upper bounds 

We now state our main results concerning the recovery of an approximately low-rank matrix 
Mq using the max-norm constrained minimization method. 

Theorem 3.1. Suppose that the noise sequence {(^t} o,i"s i.i.d. standard normal random 
variables, and the unknown matrix Mq G }C{a,R) for some R > a > 0. Then there exists 
an absolute constants C such that for any t £ (0, 1) and a sample size 2 < n < did2, 

E E -H(iVW - Mo)L < C| (a V a)Rjl + | (3.5) 

k=i 1=1 I \ n n ) 

holds with probability greater than 1 — e~'^ — t. If, in addition, the assumption (3.2) is 
satisfied, then for a sample size d < n < did2, 



holds with probability at least 1 — 2e . 

Remark 3.1. The upper bounds given in Theorem 3.1 hold with high probability. The 
rate of convergence under expectation can be obtained as a direct consequence. More 
specifically, for a sample size n with d < n < did2, we have 



It can be seen from the proof of Theorem 3.1 that the normality assumption on the 
noise can be relaxed to a class of sub- exponential random variables, those with at least an 
exponential tail decay. 

Corollary 3.1. Under the assumptions of Theorem 3.1, but assume instead that the noise 
sequence {£,t} ore independent sub- exponential random variables, that is, there is a constant 
K > such that 




(3.6) 



Mo&fC{a,R) "1"2 




(3.7) 



t=l,...,n 



max E[exp(|^i|/A')] < e. 




Then, for a sample size d < n < did2. 



di d2 




E E ^fcK^max - Mo)li < C{a V aK)R 



(3.9) 



k=l 1=1 



with probability greater than 1 — 2e 



d 



where C > is an absolute constant. 
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3.4 Information-theoretic lower bounds 



Theorem 3.1 gives the rate of convergence for the max- norm constrained least squares 
estimator Mmax- In this section we shah use information-theoretical methods to establish 
a minimax lower bound for non-uniform sampling at random matrix completion on the 
max-norm ball. The minimax lower bound matches the rate of convergence given in (3.6) 
when the sampling distribution 11 satisfies < minjt,i{vrfc/} < max^^/lvrfc^} < for 
some constants /i and L. The results show that the max-norm constrained least squares 
estimator is indeed rate-optimal in such a setting. 

For the lower bound, we shall assume the sampling distribution 11 satisfies 

max{7Tki} < (3.10) 

for a positive constant L > 1. Clearly, when L = 1, this amounts to say that the sampling 
distribution is uniform. 

Theorem 3.2. Suppose that the noise sequence {(,t} a-re i.i.d. standard normal ran- 
dom variables, the sampling distribution IT obeys the condition (3.10) and the quintuple 
{n,di,d2,a,R) satisfies 

48a2 < r2 < ^Hdi/\d2)did2 , . 

— — ioor„' \ • J 



diV d2 ~ ~ 128Ln 
Then the minimax \\ ■ Wp-risk is lower bounded as 

inf sup -— E||M-M||2,>min — , — J— . (3.12) 
M M&K{a,R) «1«2 I 16 256 V nL J 

In particular, for a sample size n > (f )^f , we have 

1 ,, - ,,n (a A a)R I d 
mf sup -— E||M-M||^ > ^ J J — . (3.13) 

M MelC[a,R) "1«2 ^06 V UL 

Assume that both n and L, respectively appeared in (3.2) and (3.10), are bounded above 
by universal constants, then comparing the lower bound (3.13) with the upper bound (3.7) 
shows that if the sample size n > {R/a)'^d, the optimal rate of convergence is {a f\a)R^I~^, 
i.e., 

inf sup -^E||M-M||| X (a Ao-)i?J-, 

M MelC(a,R) "1"2 V n 

and the max-norm constrained minimization estimator (3.4) is rate-optimal. The require- 
ment here on the sample size re > {R/a)'^d is weak. 

The proof of Theorem 3.2 follows the same outline as in [32], using information-theoretic 
methods. A key technical tool for the proof is the following lemma which guarantees the 
existence of a suitably large packing set for /C(a, R) in the Probenius norm. 
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Lemma 3.1. Let r = {R/a)^ and let j < 1 be such that ^ < di A d2 is an integer. There 
exists a subset Ai C /C(a, R) with cardinality 

'r{diy d2y 



\M\ 

and with the following properties: 



exp 



I672 



+ 1 



(i) For any M £ M, rank(M) < ^ and Mm G {±70}, such that 
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did2 



(ii) For any two distinct M'^, M'- £ M, 



did: 

The proof of Lemma 3.1 follows from Lemma 3 of [12] with a simple modification, which 
for self-containment, is given in Section 5.3. 

3.5 Comparison to past work 

It is now well-known that the exact recovery of a low-rank matrix in the noiseless case 
requires the "incoherence conditions" on the target matrix Mq [8, 9, 35, 17]. Listead, we 
consider here the more general setting of approximately low-rank matrices, and prove that 
approximate recovery is still possible without the subtle structural conditions. 

Our results are directly comparable to those of Negahban and Wainwright [32] , in which 
the trace-norm was used as a proxy to the rank. Specifically, Negahban and Wainwright 
[32] considered the setting where the sampling distribution is a product distribution, i.e. 

VTfci = VTfc.Tr.;, for all {k,l) G [di] x [^2], 



where tt^. and vr./ are marginals satisfying 

T^k- ^ r- 1 ^-i — r- fo'^ some u > 1. (3.14) 

Accordingly, define the weighted norms as 

l|M||^(t) := llv^M^llt, t G {i^,*,oo}, 

where Wr = di • diag(7ri., vTrf^.) and Wc = d2 ■ diag(7r.dj , vr.dj). Assuming that the 
unknown matrix Mq satisfies 



||Mo||^(c 



a 



llMoll^W <iiV^, ||Mo||^(^)<V^ and , , < 
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then based on a collection of observations 



Yitjt = et{MQ)it,jt + ait, t = 1, n 

where {it,jt) are i.i.d. according to ¥[{it,jt) = (k,!)] = vr^; and et G { — 1,+1} are i.i.d. 
random signs, they proposed the following estimator of Mq 

1 " 

e arg min |- V(y,^ ■ - etM,,jf + A„||M|U(,)| (3.15) 



t=i 



and proved that for properly chosen A„ depending on a, there exist absolute constants q 
such that 

' W.-Mof,<cJ{aVu)Ra\f^^+'-^], (3.16) 



di(i2 I V n n ^ 

holds with probability at least 1 — C2 exp(— C3 log d). 

First, the product distribution assumption is very restrictive and is not valid in many 
applications. For example, in the case of the Netflix problem, this assumption would imply 
that conditional on any movie, it will be rated by all users with the same probability. 
Second, the constraint on Mq highly depends on the true sampling distribution which is 
really unknown in practice and can only be estimated based on the empirical frequencies, 
i.e. for any pair {k,l) £ [di] x [^2], 

EtLil{it=fe} . Ylt=ihjt=l} 
n n 

Since only a relatively small sample of the entries of Mq is observed, these estimates are 
unlikely to be accurate. The max-norm constrained minimization approach, on the other 
hand, is proved (Theorem 3.1) to be effective in the presence of non-degenerate general 
sampling distributions. The method does not require either a product distribution or 
the knowledge of the exact true sampling distribution. Hence, the max-norm constrained 
method indeed yields a more robust approximate recovery guarantee, with respect to the 
sampling distributions. 

We now turn to the special case of uniform sampling. The "spikeness" assumption in 
[32] can actually be reduced to a single constraint on the ^oo-norm (see, e.g. [21]). Let 
]Boo(a) = {M G . ||M||oo < a} be the £oo-norm ball with radius a. Define the class 

of matrices 

/C, (a, i?) := |mg Boo (a) : <i?|. (3.17) 

It can be seen from (2.1) and (2.2) that {M G Boo(a) : rank(Af) < r} C IC{a,a^) C 
/C*(a, ay/r). The following results provide upper bounds on the accuracy of both the max- 
and trace-norm regularized estimators under the Frobenius norm. 
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Corollary 3.2. Suppose that the noise sequence {(,t} are i.i.d. A^(0, 1) random variables 
and the sampling distribution 11 is uniform on [di] x [^2]- Then the following inequalities 
hold with probability at least 1 — 3/d; 

(i) The optimum Mmax to the convex program (3.4) satisfies 

sup :tV|I^— - MoWl < (a V a)R\l^ + (3.I8) 
Mo&K{a,R) "ia2 V n n 

(ii) The minima Af* to the SDP (3.15) with all weighted norms replaced by the standard 
ones and with a properly chosen A„ satisfies 



sup ;^||M.-Mo||^<(.vW^M^ + ^^:M!^. (3.19) 

The upper bound (3.18) follows immediately from (3.5) in Theorem 3.1, and (3.19) 
is a direct extension of Theorem 7 in Klopp (2012) which considers the case of the exact 
low-rank matrices, i.e. rank(Mo) < r. The proof is essentially the same and thus is omitted. 

Foygel and Srebro [14] analyzed the estimation error of Mmax based on an excess risk 
bound for empirical risk minimization with a smooth loss function recently developed in 
[42]. Specifically, assuming sub-exponential noise and Mq G IC{a,R), it was shown that 
with high probability. 



' 1M„., - M„||^ < (. V a)RrJ^l^ ^ . (3.20) 



aid2 y n n 

After a more delicate analysis, our result shows that the additional log'^(n/d) factor in 
(3.20) is purely an artifact of the proof technique and thus can be avoided. Moreover, in 
view of the lower bounds given in Theorem 3.2, we see that the max- norm constrained least 
square estimator Mmax achieves the optimal rate of convergence for recovering approxi- 
mately low-rank matrices over the parameter space fC{a, R) under the Frobenius norm loss. 
To our knowledge, the best known rate for trace-norm regularized estimator ((3.19)) is 
near-optimal up to logarithmic factors in a minimax sense, over a larger parameter space 
/C*(a, R). 



4 Computational Algorithm 

Although Theorem 3.1 presents theoretical guarantees that hold uniformly for any global 
minimizer, it does not provide guidance on how to approximate such a global minimizer 
using a polynomial-time algorithm. A parallel line of work has studied computationally 
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efficient algorithms for solving problems with the trace-norm constraint or penalization. 
See, for instance, Mazumber, et al. [30], Nesterov [33] and Lin, et al. [27] among others. 
Here we restrict our attention to the less-studied max- norm based approach. We recommend 
using the fast first-order algorithms developed in Lee, et al. [26], which is particularly 
tailored for large scale optimization problems that incorporate the max-norm. The problem 
of interest to us is the optimization program (3.4) with both the max-norm and the element- 
wise £oo-iiorm constraints, in which case the algorithm introduced in [26] can be applied 
only after some slight modifications as described below. 

Due to Srebro, et al. [40], the max-norm of a di x d2 matrix M can be computed via a 
semi-definite program: 



[Mllmax = min s.t. ( ' 1^0, diag(W^i) < i?, diag{W2) < R. 

M W2 



Correspondingly, we can reformulate (3.4) as the following SDP problem 



min/(M;y) s.t. ( ' ) ^ 0, diag(VFi) < i?, diag(t^2) < ^, ||A^||oo < «, 

M W2 



where the objective function / is given by 

f{M;Y) = Cn{M;Y). 

This SDP can be solved using standard interior-point methods, though are fairly slow and 
do not scale to matrices with large dimensions. For large-scale problems, an alternative 
factorization method based on (LI), as described below, is preferred [26]. 

We begin by introducing dummy variables U G M'^^^*^, V G M'^2xfc £qj. gQjj^g 1 < k < 
di + d2 and let M = UV'^ . If the optimal solution Mmax is known to have rank at most r, 
we can take U G M'^^^^'''''^^ V G M'^2^(''+i). In practice, without a known guarantee on the 
rank of Mmax) we alternatively truncate the number of columns k to some reasonably high 
value less that di + d2- Then we rewrite the original problem (3.4) in the factored form as 
follows: 

minimize f{UV'^;Y) 

subject to max{||[/||2oo> ll^llioo} < -R) max |f/f y,| < a. (4.1) 

This problem is non-convex, since it involves a constraint on all product factorizations 
UV'^ . However, when the size of the problem (i.e. k) is large enough, Burer and Choi 
(2006) proved that this reformulated problem has no local minima. To solve this problem 
fast and efficiently, Lee, et al. [26] suggest the following first-order methods. 
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4.1 Projected gradient method 

Notice that f{M;Y) = C{M;Y) is difFerentiable with respect to the first argument. The 
method of projected gradient descent generates a sequence of iterates {(?/*, V*),t = 0, 1, 2, ...} 
by the recursion: First define an intermediate iterate 

-T-Vf{U\V'f;Y)V' 
V -T-Vf{U\V'f;YfU' 

where r > is a stepsize parameter. If ||C/*~^"'^(y*"'"^)"^||oo > we replace 



" lJt+1 - 




yt+l 





IJt+l 

yt+l 



with 



||[/t+i(ym)T||V2 
otherwise we keep it stiU. Next, compute updates according to 



Ijt+i 
yt+l 



- IJt+l - 




" (jt+i - 


yt+l 




yt+l 



where 11^ denotes the Euchdean projection onto the set {([/, V) : max(||i7||2 ^, \\V\\2 oo) < 
R}. This projection can be computed by re-scahng the rows of the current iterate whose 
^2-norms exceed R so that their norms become exactly R, while rows with norms already 
less than R remain unchanged. 



4.2 Stepwise gradient 

For the matrix completion problem, we allow the objective function to act on matrices via 
the average loss function over their entries: 



f{M;Y) = f{UV^;Y) 



1^ 



n 



^g{UlV,,-Y,,,,,), 
t=i 



where S = {{ii,ji), {in-,jn)} C [di] x [^2] is a training set of row-column indices, Ui and 
Vj denote the ith row of U and jth row of V, respectively. We are currently interested in 
the case where g{t; y) = {t — y)'^. 

In view of the above decomposition for /, it is thus natural to use a stepwise gradient 
method: enumerate all elements of S in an arbitrary order with repeated ones only counted 
once; for the pair {it,jt) at the t-th iteration, take a step in the direction opposite to the 
gradient of g{U^Vj^;Yi^j^), then apply the rescaling and the projection described in the 
last subsection. More precisely, if > a, Ui^ and Vi^ are replaced with 



|2 



and . J /2 respectively, otherwise we do not make any change; next, if ||?7iJ|2 > 



we project it back so that \\Ut 



H\\2 



R, otherwise we do not make any change (the same 
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procedure for Vj^). In the t-th iteration, we do not need to consider any other rows of U and 
V. As demonstrated in [26], this stepwise algorithm could be computationally as efficient 
as optimization with the trace-norm. 

4.3 Implementation 

Before the max-norm constraint approach can be actually implemented in practice to gen- 
erate a full matrix by filling in missing entries, additional prior knowledge of the unknown 
true matrix is needed to avoid deviated results. As before, let Mq S M'^^^'^^ i^q it^^^q 
underlying matrix. Good upper bounds for the following key quantities are needed in 
advance: 

ao = ll^olloo, Ro = ||Mo||max and ro = rank(Mo). (4.1) 

In order to estimate Rq directly from a missing data matrix, it can be seen from (2.2) that 
aQy/rQ is a sharp upper bound on Rq and is more amenable to estimation. Fortunately, it is 
possible to convincingly specify ao beforehand in many real-life applications. When dealing 
with the Netflix data, for instance, qq can be chosen as the highest rating index; in the 
structure-from-motion problem, oq depends on the range of the camera field of view, which 
in most cases is sufficiently large to capture the feature point trajectories. In case where 
the percentage of missing entries is low, the largest magnitude of the observed entries can 
be used as an alternative for oq. 

As for ro, we recommend the rank estimation approach recently developed in [19], which 
was shown to be effective in computer vision problems. Recall that in the structure-from- 
motion problem, each column of the data matrix corresponds a trajectory along the frames 
of a given feature point, and can be regarded as a signal vector with missing coordinates. 
Due to the rigidity of the moving objects, it was noted in [19] that the behavior of observed 
and missing data is the same and thus they both generate an analogous (frequency) spectral 
representation. Motivated by this observation, the proposed approach is based on the study 
of changes in frequency spectra on the initial matrix after missing entries are recovered. 

Next we describe an implementation of the max-norm constraint matrix completion 
procedure, which incorporates the rank estimation approach in [19]. Assume without loss 
of generality that ao is known. 

(1) Given the observed partial matrix Ms, the initial matrix Mini is obtained by adding 
the average of the corresponding column to the missing entries of Ms- Applying the 
Fast Fourier Transform (FFT) to the columns of Mini and taking its modulus, i.e. 
F:= \FFT{Mim)\. 

(2) Set an initial rank r = 2 and an upper bound rmax- Clearly, Tmax < m.m{di,d2} and 
it can be computed automatically by adding a criteria for stopping the iteration. 
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(3) For the current value of r, using the computational algorithms given in Section 4 
with R = ao^/r to solve the max-norm constraint optimization (3.4). The resulting 
estimated full matrix is denoted by Mr- 

(4) Apply the FFT to Mr as in step 1. Write Fr = \FFT{Mr)\ and compute the error 
e(r) = \\F-Fr\\F. 

(5) If r < Tinax) set r = r + 1 and go to step 3. 
Finally, let 

r* = arg min e(r) 

and the corresponding Mj.* is the final estimate of Mq. Clearly, the above procedure can 
be modified by replacing the rank r with the max-norm R. A suitable initial value for the 
max-norm is R = aoV^ and at each iteration, increase R = R + 6 with a fixed step size 
6 > 0. An upper-bound -Rmax could be automatically computed by adding some criteria for 
stopping the iteration. 

5 Proofs 

We prove the main results. Theorems 3.1 and 3.2, in this section. The proofs of a few key 
technical lemmas including Lemma 3.1 are also given. 

5.1 Proof of Theorem 3.1 

For simplicity, we write M = Mmax as long as there is no ambiguity. To begin with, noting 
that M is optimal and Mq is feasible for the convex optimization problem (3.4), we thus 
have the basic inequality 

t=i t=i 
This, combined with our model assumption that Yi^^j^ = {MQ)i^j^ + cr^t, yields 

It^^ln = ^Ei^^'.^* - (^o)..J^ < ^E6A,,„ (5.1) 
t=i t=i t=i 

where A = M — Mq G /C(2a, 2R) is the error matrix. Then we see that the major challenges 
in proving Theorem 3.1 consist of two parts, bounding the left-hand side of (5.1) from below 
in a uniform sense and the right-hand side of (5.1) from above. 
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Step 1. (Upper bound). Recall that {CijtLi is a sequence of A^(0, 1) random variables and 
S = {(ii,ii), {imjn)} is drawn i.i.d. according to 11 on [di] x [^2], define 



nn{a,R):= sup -V^tMj.j, 



(5.2) 



Due to Maurey and Pisier [34], we obtain that for any realization of the random training 
set S and for any 5 > 0, with probability at least 1 — 5 over ^ = {^t}, 

sup -V6A4,it 



< IE5 



MeK{a,R) ' " 



< E, 



t=i 

n 



sup -V 6^4, it 

M&K(a,R) ' " 



+ vr1 



+ vra 



'log(l/5) maxA/GK:(a,i?) ELi 



2 



2n2 



log(l/'^) 
2n 



(5.3) 



Thus it remains to estimate the following expectation over the class of matrices /C(a, R): 

7^n:=E^ sup -| 6^4, it 
As a direct consequence of (2.4), we have 



nn<KG■R■^^ sup 



(5.4) 



where A^-t contains rank-one sign matrices with cardinality \M±\ = 2'^~^ . For each M G 
X^"=i Ci-^itjt is Gaussian with mean zero and variance n. Then the expectation of 
this Gaussian maxima can be bounded by 

^j2n\og{\M±\) < \/2bg2\/^. 

Since the upper bound is uniform with respect to all realizations of S, we conclude that 
with probability at least 1 — 5 over both the random sample S and the noise ^j. 



7^„(a,i^) <A\R\ -+a 



log(l/5) 



n 



(5.5) 



On the other hand, in the case of sub-exponential noise, i.e. {^t} satisfies the assumption 
(3.8), it follows from (2.4) that 



I 1 " 

7t„(a, R)<Kg-R- sup - V itM,^,j, 

M&M± 'IT' jr[ 



with \M±\ = 2 



d-l 
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For any realization of the random training set 5 = {(ii, ji), (imjn)} and for any M G 
Ai± fixed, using Bernstein-type inequality for sub-exponential random variables [45] yields 



c • mm 



nt 



where c > is an absolute constant. By the union bound, we obtain that for a sample size 
n> d, 

nJa,R) <CKR\ - (5.6) 
V n 

holds with probability at least 1 — e~'^. 

Step 2. (Lower bound). For the given sampling distribution 11, define 

\\M\\l = h.s^n[\\Ms\\l]=^i^,,,,)^n{Ml,:\ = ^vr^Mf 



where Ms = (Mj^ j-^, Mi^j^)'-^ € for a given training set S. Then, for any (3 > 1,6 > 0, 
consider the following subset 



C{f3,6) :={mg/C(1,/3) : \\M\\l>5]. 



Here 6 can be regarded as a tolerance parameter. The goal is to show that there exists 
some function such that with high probability, the following inequality 



-\\Msg>l\\M\\l-fpin,di,d2) 
n 2 



(5.7) 



holds for all M gC(/3,5). 



Proof of (5.7). Instead, we will prove a stronger result that 

^llM^lli- ||M||2i| < ^\\Mfn + f^{n,di,d2) 

for all M G C{(3,6), with high probability. Following the peeling argument as in [32], for 
£ = 1,2,... and a = |, define a sequence of subsets 

Ce{P,6) := {M e C{^,6) : < ||M||^ < a^6} 

and for any radius D > 0, set 

B{D) := {M G C(/5,(5) : ||M||^ < D}. (5.8) 

Therefore, if there exists some M £ C{f3,6) satisfying 



h^\Ms\\l-\\M\\l\ > i||M||2i + /^(n,di,d2), 



18 



then there corresponds an £ > 1 such that, M S Ci{l3, 5) C B{a^5) and 



i||M5||i-||M||2i > \a'5 + fp{n,di,d2). 
n 6 

So the main task is to show that the latter event occurs with high probability. To this end, 
define the maximum deviation 



^d{S):= sup n~^\\Ms\\l-\\M\\l 

M&B{D) 



(5.9) 



The fohowing lemma shows that n ^||M5||2 does not deviate far from its expectation uni- 
formly for ah M G B{D). 

Lemma 5.1 (Concentration). There exists a universal positive constant Ci such that for 
any radius D > 0, 

( n n ^ _ , _ 

(5.10) 



»|A,,(5)>| + Ci/3y|}<e--^A0. 



In view of the above lemma, we can set //^(n, di, ^2) = Ci/3y ^ and consider the follow- 
ing sequence of events 

£e = {A^isiS) > ^a'6 + fp{n,di,d2)}, £=1,2,.... 
Since C{/3, 5) = U^>iC£(/3, 6), using the union bound we have 

p|3MgC(/3,<5), s.t. |i||Ms||i-||M||2i| > i||M||2i + /^(n,di,d2)} 

kY^fLm eCi{l3,6),s.t. ^\\Ms\\l-\\M\\l >hM\\l + fp{n,di,d2)] 

00 00 
<^P{£^) < ^ exp(-naVlO) 
e=i e=i 

<f exp{-log(a).nVlO}<-^^P^^^ 
^ 1 - exp(-con5) 

with Co = log(3/2)/10, where we used the elementary inequality that 

= exp{£log(a)} > ^log(Q!). 

Consequently, for a sample size n < did2 satisfying exp(— conJ) < i, or equivalently, 
n > we obtain that 



-\\Ms\\l > l\\M\\l - C1/3J - for all M G C(/3,5) 



(5.11) 
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with probability greater than 1 — 2 exp(— CQfi^). 



Step 3. Now we combine the results in Step 1 and Step 2 to finish the proof. On one hand, 
it follows from (5.5) that for a sample size 2 < n < did2, 



1 " I'd 
- V6A(it,jt) <7^„(2a,2i^) < 8{R + a)\ - 
n V n 



n 

t=i 



holds with probability at least 1 — e On the other hand, set A = A/ (2a), so that 
||A||oo < 1 and ||A||max < R/a := j3. Then for any t > 0, applying (5.11) with 6 = ^-^^^ 
yields that for a sample size 2 < n < did2, 



A||fi < maxU, -\\As\\i + 2l3Cv 



n 



with probability at least 1 — t. Above estimates, joint with the basic inequality (5.1) im- 
plies the final conclusion (3.5) after a simple rescaling. Similarly, using the upper bound 
(5.6), instead of (5.5), together with the lower bound (5.11) gives (3.9) in the case of sub- 
exponential noise. | 



5.1.1 Proof of Lemma 5.1 

Here we prove the concentration inequality given by Lemma 5.1. The argument is based on 
techniques of probability in Banach spaces, including symmetrization, contraction inequal- 
ity and Bousquet's version of Talagrand concentration inequality, and the upper bound 
(2.6) on the empirical Rademacher complexity of the max-norm ball. 

Consider each matrix M £ M°'iX'^2 a function: [di] x [^2] M, i.e. M{k,l) = Mm, 
and rewrite the empirical process of interest as follows: 

I 1 " 

Ad{S) = sup \-y2fM{it,jt) - E[fM{it,jt)] , fM{-) = {M{-)f. 

fM:MeB{D) 

Recall that \Mki\ < ||M||oo < 1 for all pairs {k,l) and 



sup Var[/A/(ii,ji)] < sup 

M£B{D) M£B{D) 



\M\\l\\Mfu<D. 



We first bound E5'^n[Aj:)(5)], then show that Ad{S) is concentrated around its mean. Now 
a standard symmetrization argument [25] yields 

%~n[AD(5)] < 2Es^nUe \ sup I- Ve^M^ • s] |, 
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where {ej}"^^ is an i.i.d. Rademacher sequence, independent of 5. Given an index set 
S = {{ii,ji),...,{in,jn)}, since iMj^jJ < 1, using Ledoux-Talagrand contraction inequality 
[25] implies (d = di + ^2) 



sup 



1 " 



2 



< 4E, 



< 4E, 



< 48/3-1/- 
n 



I 1 " 



MeB{D) 



t=i 



S 



sup -Vet Mi, J, 



5 



where the last step used (2.6). Now that the "worst-case" Rademacher complexity is 
uniformly bounded, we have 



(5.12) 



Next, using Bousquet's version of Talagrand concentration inequality for empirical processes 
indexed by bounded functions implies that for all t > 0, with probability at least 1 — e~*. 



Ad{S) < 2 
So our conclusion (5.10) follows by taking t 



n n 



nP 

10 • 



5.2 Proof of Theorem 3.2 



By construction in Lemma 3.1, set 5 = ^a^J (Lxd^jl and we see that 7W is a 5-packing set 
of /C(q, K) in the Frobenius norm. Next, a standard argument (e.g. [46, 47]) yields a lower 
bound on the || • ||i?-risk in terms of the error in a multi-way hypothesis testing problem. 
More concretely, 

x2 

inf max E||M - Mfp > — minP(Af / M*), 

M M&K{a,R) 4 M 

where the random variable M* G ]R'^i><'^2 jg uniformly distributed over the packing set A4. 
Conditional on S = {(ii, Ji), {in,jn)}, a variant of Fano's inequality [11] gives the lower 
bound 

F{M^M*\S)>1-^^ —, (5.13) 

log|A^| 

where i^(M''||M') is the KuUback-Leibler divergence between distributions {Ys\M'') and 
(I5IM'). For the observation model (3.1) with i.i.d. Gaussian noise, we have 



1 



2ct2 
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and 



EsiKiM^'llM^)] = -^W^ - A^'lln. (5-14) 



where || • ||n denotes the weighted Frobenious norm with respect to 11, that is, 



l^^lln = E E ""^i^ki, for any M G M^^^'^^ 

k=l 1=1 



For any two distinct M^,M^ G M, \\M^ - M'|||, < 4(iid27^, which together with (5.13), 
(5.14) and the assumption maxk^i{7rki} < imphes 



(\M\\-^ 



P(M / M*) > 1 



2 



Zk^iEs[K{M'^\\M^)]+log2 



log\M\ 



^2^4^ + 1272 32L7Wn 12 1 
> 1 !_ > 1 ! > _ 

r(ciiVd2) ~ cj2r((ii V (^2) r((ii V (^2) ~ 2 ' 

provided that r(di V ^2) > 48 and 7' < i^sL^ • 128W '^''T'^ > 1' ^^en we 
choose 7 = 1 so that 

inf max -^E\\M-M\\l>—. 

M MeK.{a,r) did2 16 

Otherwise, as long as the parameters {n,di,d2,a, R) satisfies (3.11), setting 



^2 _ . /'''{di V d2) 



8\/2a V nL 
yields 



. f. 1 ,,,,9 r(di\/d2) crR / d 

inf max EM-M%> ^\/-^-^ —> \ — , 

M MeBmax(fl) did2 128V2 v nL 256 V nL 



as desired. 



5.3 Proof of Lemma 3.1 

We proceed via the probabilistic method. Assume without loss of generality that d2 > di. 
Let = exp(^^), B = and for each i = 1, ...,N, we draw a random matrix M* G 
^dixd2 follows: the matrix M* consists of i.i.d. blocks of dimensions B x d2, stacked from 
top to bottom, with the entries of the first block being i.i.d. symmetric random variables 
taking values ±07, such that 

Mil ■■= Ml,i, k' = fe(mod B) + 1. 
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Next we show that above random procedure succeeds in generating a set having all desired 
properties, with non-zero probability. For 1 < -i < A^, it is easy to see that 

1 



07 < a, 



2„,2 



did2 



and since rank(M*) < B, 



< VB M' 



= a\/r = R. 



Thus AP G /C(q, R), and it remains to show that the set {M*}^-^ satisfies property (ii). In 
fact, for any pair 1 < i ^ j < N , 



B d2 



B d2 



k,l k=l j=l k=l j=l 

where 6ki are independent 0/1 Bernoulli random variables with mean 1/2. Using the Ho- 
effding's inequality gives 



B d2 



EE 4. 

k=i j=i 
-2 



> 



d^B 



< exp(-(i2-B/8). 



Since there are less than N /2 such index pairs in total, above inequality, together with a 
union bound implies that with probability at least 1 — ^ exp(— (^2-6/8) > 1/2, 



di 
IB 



\\M' - M^\l > a^^^ 
This completes the proof of Lemma 3.1. 



d2B > 



for all i ^ j. 



References 

[1] Argyriou, a., Evgeniou, T. and Pontil, M. (2008). Convex multi-task feature learn- 
ing. Mach. Learn. 73 243-272. 

[2] Bartlett, p. and Mendelson, S. (2002). Rademacher and Gaussian complexities: 
Risk bounds and structural results. J. Mach. Learn. Res. 3 463-482. 

[3] Biswas, P., Lian, T.-C, Wang, T.-C. and Ye, Y. (2006). Semidefinite programming 
based algorithms for sensor network localization. ACM Trans. Sen. Netw. 2 188-220. 

[4] BuRER, S. and Choi, C. (2006). Computational enhancements in low-rank semidefi- 
nite programming. Optimization Methods and Software 21 493-512. 



23 



[5] Chistov, A.L. and Grigoriev, D.Y. (1984). Complexity of quantifier elimination in 
the theory of algebraically closed fields. In Proceedings of the 11th Symposium on Math- 
ematical Foundations of Computer Science, volume 176 of Lecture Notes in Computer 
Science, pages 17-31. Springer Verlag. 

[6] Candes, E. and Plan, Y. (2010). Matrix completion with noise. Proc. IEEE 98 
925-936. 

[7] Candes, E. and Plan, Y. (2011). Tight oracle bounds for low-rank matrix recovery 
from a minimal number of random measurements. IEEE Trans. Inform. Theory 57 
2342-2359. 

[8] Candes, E. and Recht, B. (2009). Exact matrix completion via convex optimization. 
Found. Comput. Math. 9 717-772. 

[9] Candes, E. and Tao, T. (2010). The power of convex relaxations: Near-optimal 
matrix completion. IEEE Trans. Inform. Theory 56 2053-2080. 

[10] Chen, P. and Suter, D. (2004). Recovering the missing components in a large noisy 
low-rank matrix: application to SFM. IEEE Trans. Pattern Anal. Mach. Intell. 26 
1051-1063. 

[11] Cover, T.M. and Thomas, J. A. (1991). Elements of Information Theory. John 
Wiley and Sons, New York. 

[12] Davenport, M.A., Plan, Y., van den Berg, E. and Wootters, M. (2012). 1-bit 
matrix completion. arXiv: 1209. 3672. 

[13] FOYGEL, R., Salakhutdinov, R., Shamir, R. and Srebro, N. (2011). Learning 
with the weighted trace-norm under arbitrary sampling distributions. Advances in 
Neural Information Processing Systems (NIPS), 24. 

[14] FOYGEL, R. and Srebro, N. (2011). Concentration-based guarantees for low-rank 
matrix reconstruction. 24th Annual Conference on Learning Theory ( COLT). 

[15] Goldberg, D., Nichols, D., Oki, B.M. and Terry, D. (1992). Using collaborative 
filtering to weave an information tapestry. Comm. ACM 61-70. 

[16] Green, P. and Wind, Y. (1973). Multivariate decisions in marketing: A measurement 
approach. Dryden, Hinsdale, IL. 

[17] Gross, D. (2011). Recovering low-rank matrices from few coefficients in any basis. 
IEEE Trans. Inform. Theory 57 1548-1566. 



24 



[18] Jameson, G.J.O. (1987). Summing and Nuclear Norms in Banach Space Theory. 
Number 8 in London Mathematical Society Student Texts. Cambridge University 
Press, Cambridge, UK. 

[19] Julia, C, Sappa, A.D., Lumbreras, F., Serrat, J. and Lopez, A. (2011). Rank 
estimation in missing data matrix problems. J. Math, imaging Vis. 39 140-160. 

[20] Keshavan, R., Montanari, A. and Oh, S. (2010). Matrix completion from noisy 
entries. J. Mach. Learn. Res. 11 2057-2078. 

[21] Klopp, O. (2012). Noisy low-rank matrix completion with general sampling distribu- 
tion. arXiv:1203.0108. 

[22] Klopp, O. (2011). Rank penalized estimators for high-dimensional matrices. Electron. 
J. Stat. 5 1161-1183. 

[23] KOLTCHiNSKii, V. (2011). Von Neumann entropy penalization and low-rank matrix 
estimation. Ann. Statist. 39 2936-2973. 

[24] KOLTCHINSKII, v., LOUNICI, K. and TSYBAKOV, A.B. (2011). Nuclear norm pe- 
nalization and optimal rates for noisy low rank matrix completion. Ann. Statist. 39 
2302-2329. 

[25] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry 
and Processes. Springer- Verlag, New York, NY. 

[26] Lee, J., Recht, B., Salakhutdinov, R., Srebro, N. and Tropp, J. (2010). Prac- 
tical large-scale optimization for max-norm regularization. Advances in Neural Infor- 
mation Processing Systems, 23. 

[27] Lin, Z., Ganesh, A., Wright, J., Wu, L., Chen, M. and Ma, Y. (2009). Fast 
convex optimization algorithms for exact recovery of a corrupted low-rank matrix. In 
Computational Advances in Multi-Sensor Adaptive Processing ( CAMSAP). 

[28] LiNiAL, N., Mendelson, S., Schechtman, G. and Shraibman, A. (2004). Com- 
plexity measures of sign measures. Combinatorica 27 439-463. 

[29] Liu, Z. and Vandenberghe, L. (2009). Interior-point method for nuclear norm ap- 
proximation with application to system identification. SIAM J. Matrix Analysis and 
Applications, 31 1235-1256. 

[30] Mazumber, R., Hastie, T. and Tibshirani, R. (2010). Spectral regularization al- 
gorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11 2287-2322. 



25 



[31] Negahban, S. and Wainwright, M.J. (2011). Estimation of (near) low-rank matri- 
ces with noise and high-dimensional scaling. Ann. Statist. 39 1069-1097. 

[32] Negahban, S. and Wainwright, M.J. (2012). Restricted strong convexity and 
weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 13 
1665-1697. 

[33] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. 
Technical Report - CORE - Universite Catholique de Louvain. 

[34] PisiER, G. (1989). The volume of convex bodies and Banach space geometry. Cam- 
bridge University Press, Cambridge. 

[35] Recht, B. (2011). A simpler approach to matrix completion. J. Mach. Learn. Res. 
12 3413-3430. 

[36] ROHDE, A. and Tsybakov, A.B. (2011). Estimation of high-dimensional low-rank 
matrices. Ann. Statist. 39 887-930. 

[37] Salakhutdinov, R. and Srebro, N. (2010). Collaborative filtering in a non-uniform 
world: Learning with the weighted trace norm. Advances in Neural Information Pro- 
cessing Systems (NIPS), 23. 

[38] Singer, A. (2008). A remark on global positioning from local distances. Proc. Natl. 
Acad. Sci. 105 9507-9511. 

[39] Singer, A. and Gucuringu, M. (2010). Uniqueness of low-rank matrix completion 
by rigidity theory. SIAM J. Matrix Analysis and Applications, 31 1621-1641. 

[40] Srebro, N., Rennie, J. and Jaakkola. T. (2004). Maximum- margin matrix factor- 
ization. In Advances in Neural Information Processing Systems 17 (L. Saul, Y. Weiss 
and L. Bottou, eds.) 1329-1336. MIT Press, Cambridge, MA. 

[41] Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Learn- 
ing Theory, Proceedings of COLT-2005. Lecture Notes in Comput. Sci. 3559 545-560. 
Springer, Berlin. 

[42] Srebro, N., Sridharan, K. and Tewari, A. (2010). Optimistic Rates for Learning 
with a Smooth Loss. arXiv:1009.3896v2. 

[43] ToMASi, C. and Kanade, T. (2012). Shape and motion from image streams under 
orthography: a factorization method. Int. J. Comput. Vis. 9 137-154. 



26 



[44] Tropp, J. a. (2012). User-friendly tail bounds for sums of random matrices. Found. 
Comput. Math. 12 389-434. 

[45] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random ma- 
trices. In Compressed Sensing: Theory and Applications, Y. Eldar and G. Kutyniok, 
Eds. Cambridge University Press. 

[46] Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax 
rates of convergence. Ann. Statist. 27 1564-1599. 

[47] Yu, B. (1997). Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam. D. Pollard, 
E. Torgersen, and G. Yang (eds), 423-435, Springer- Verlag. 



27 



